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Abstract 

Students'  natural  language  (NL)  explanations  in  the 
domain  of  qualitative  mechanics  lie  in-between  unre¬ 
stricted  NL  and  the  constrained  NL  of  “proper”  do¬ 
main  statements.  Analyzing  such  input  and  providing 
appropriate  tutorial  feedback  requires  extracting  infor¬ 
mation  relevant  to  the  physics  domain  and  diagnosing 
this  information  for  possible  errors  and  gaps  in  reason¬ 
ing.  In  this  paper  we  will  describe  two  approaches  to 
solving  the  diagnosis  problem:  weighted  abductive  rea¬ 
soning  and  assumption-based  truth  maintenance  system 
(ATMS).  We  also  outline  the  features  of  knowledge  rep¬ 
resentation  (KR)  designed  to  capture  relevant  semantics 
and  to  facilitate  computational  feasibility. 

Introduction 

One  of  the  hypotheses  behind  the  creation  of  NL-based  in¬ 
telligent  tutoring  systems  is  that  allowing  students  to  provide 
unrestricted  input  to  a  system  would  trigger  meta-cognitive 
processes  that  support  learning  (i.e.  self-explaining  (Chi  el 
al.  1994))  and  help  expose  misconceptions  (Slotta,  Chi,  & 
Joram  1995).  WHY2-ATLAS  is  such  a  tutoring  system.  It  is 
designed  to  elicit  explanations  in  the  domain  of  qualitative 
physics  (VanLehn  et  al.  2002). 

A  typical  problem  and  a  student  explanation  are  shown 
in  Figure  1.  An  example  of  an  incorrect  explanation  is 
shown  in  Figure  2.  As  can  be  seen  from  the  examples,  a 
student’s  explanation  about  a  formal  domain  such  as  qual¬ 
itative  physics  may  involve  a  number  of  phenomena:  alge¬ 
braic  formulas,  NL  renderings  of  formulas,  various  degrees 
of  formality,  and  conveying  the  logical  structure  of  an  argu¬ 
ment  (Makatchev  el  al.  2005).  Tutoring  goals  involve  elicit¬ 
ing  correct  statements  of  the  appropriate  degree  of  formality 
and  their  justifications  to  address  possible  gaps  and  errors  in 
the  explanation.  To  achieve  these  goals  the  NL  understand¬ 
ing  is  required  to  answer  the  following  questions: 

•  Does  the  student  explanation  contain  errors?  If  yes,  what 
are  the  likely  buggy  assumptions  that  have  led  the  student 
to  these  errors? 
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Problem:  A  heavy  clay  ball  and  a  light  clay  ball  are  released  in 
a  vacuum  from  the  same  height  at  the  same  time.  Which  reaches 
the  ground  first?  Explain. 

Explanation:  Both  balls  will  hit  at  the  same  time.  The  only  force 
acting  on  them  is  gravity  because  nothing  touches  them.  The 
net  force,  then,  is  equal  to  the  gravitational  force.  They  have 
the  same  acceleration,  g,  because  gravitational  force=mass*g 
and  f=ma,  despite  having  different  masses  and  net  forces.  If 
they  have  the  same  acceleration  and  same  initial  velocity  of  0, 
they  have  the  same  final  velocity  because  acceleration=(final- 
initial  velocity )/elapsed  time.  If  they  have  the  same  acceleration, 
final,  and  initial  velocities,  they  have  the  same  average  veloc¬ 
ity.  They  have  the  same  displacement  because  average  veloc- 
ity=displacement/time.  The  balls  will  travel  together  until  the 
reach  the  ground. 


Figure  1 :  The  statement  of  the  problem  and  a  verbatim  stu¬ 
dent  explanation. 


•  What  required  statements  have  not  been  covered  by  the 
student?  Does  the  explanation  contain  statements  that  are 
logically  close  to  the  required  statements? 

These  requirements  imply  that  a  logical  structure  needs 
to  be  imposed  on  the  space  of  possible  domain  statements. 
Considering  such  a  structure  to  be  a  model  of  the  stu¬ 
dent’s  reasoning  about  the  domain,  the  two  requirements 
correspond  to  a  solution  of  a  model-based  diagnosis  prob¬ 
lem  (Forbus  &  de  Kleer  1993). 

How  does  one  build  such  a  model?  A  desire  to  make  the 
process  scalable  and  feasible  necessitates  an  automated  pro¬ 
cedure.  The  difficulty  is  that  this  automated  reasoner  would 
have  to  deal  with  the  NL  phenomena  that  are  relevant  for 
our  application.  In  turn,  this  means  that  the  KR  would  have 
to  be  able  to  express  these  phenomena,  that  as  we  men¬ 
tioned  above  include:  algebraic  formulas,  NL  renderings 
of  formulas,  various  degrees  of  formality,  logical  structure. 
The  reasoner  would  have  to  account  for  common  reasoning 
fallacies,  have  flexible  consistency  constraints  and  perform 
within  the  tight  requirements  of  a  real-time  dialogue  appli¬ 
cation. 

In  the  next  section  we  will  describe  a  KR  that  attempts 
to  satisfy  the  expressivity  and  efficiency  requirements.  In 
the  section  on  reasoning  we  will  present  two  approaches  for 
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The  heavy  clay  ball  and  light  clay  ball  will  never  reach  the 
ground  in  a  vacuum.  A  vacuum  has  no  air  or  gravity  so  neither 
ball  will  ever  touch  the  ground. 


Figure  2:  Another  verbatim  student  explanation. 

building  models  of  student’s  reasoning:  first,  an  on-the-fly 
approach  based  on  weighted  abductive  theorem  proving,  and 
second,  an  approach  based  on  a  precomputed  ATMS. 

Knowledge  representation 

We  have  chosen  an  order-sorted  first-order  predicate  logic 
(FOPL)  as  a  base  KR  for  our  domain  since  it  is  expressive 
enough  to  reflect  the  hierarchy  of  concepts  from  the  qualita¬ 
tive  mechanics  ontology  (Ploetzner  &  VanLehn  1997)  and 
has  a  straightforward  proof  theory  (Walther  1987).  Fol¬ 
lowing  the  representation  used  in  the  abductive  reasoner 
Tacitus-lite+  (Thomason,  Hobbs,  &  Moore  1996),  our  KR 
is  function-free,  does  not  have  quantifiers,  Skolem  constants 
or  explicit  negation.  Instead  all  variables  in  facts  or  goals 
are  assumed  to  be  existentially  quantified,  and  all  variables 
in  rules  are  either  universally  quantified  (if  they  appear  in 
premises)  or  existentially  quantified  (if  they  appear  in  con¬ 
clusions  only). 

The  first  argument  of  a  predicate  defines  the  arity  and  con¬ 
straints  on  the  sort  of  each  argument.  Possible  values  for 
the  first  argument  are:  one -body  vectors,  i.e.  position, 
displacement,  velocity,  acceleration;  a  two-body 
vector  force;  distance;  state;  etc.  Since  the  first  ar¬ 
gument  defines  the  syntax  of  a  predicate,  in  the  rest  of  this 
paper  we  will  call  the  predicate  and  the  respective  atom  by 
the  name  of  its  first  argument  (and  will  omit  a  predicate  sym¬ 
bol  preceding  the  list  of  arguments).  The  second  argument 
of  a  predicate  is  normally  a  unique  atom  identifier  used  for 
cross-referencing.  Most  of  the  predicates  end  with  two  time 
arguments  specifying  time  points  and  open  intervals. 

Although  our  KR  has  no  explicit  negation,  some  types  of 
negative  statements  are  represented  by  using  (a)  complimen¬ 
tary  sorts,  for  example  constant  and  nonconstant;  (b) 
a  relative  position  predicate,  namely  (rel-position  rpl 
nonequal  .  .  . ) ;  or  (c)  a  comparison  of  quantities  pred¬ 
icate  (compare  cl  ?varl  ?var2  ...  ?diff),  where 
the  ?dif  f  argument  is  of  sort  nonzero  to  denote  inequality. 

Below  we  describe  the  representations  for  each  of  the  rel¬ 
evant  NL  phenomena  mentioned  in  the  introduction. 

Algebraic  formulas  and  NL  rendering  of  formulas 

Most  of  the  NL  expressions  that  the  system  must  process 
do  not  contain  direct  references  to  formulas  (e.g.  Figure  2). 
However,  those  utterances  that  do  use  algebraic  expressions 
are  usually  highly  informative  for  the  tutor  and  must  be  iden¬ 
tified.  Examples  of  such  expressions  include  “acceleration  is 
final  velocity  minus  initial  velocity  over  elapsed  time”,  “9.8 
m/s  "2”,  “a  =  9.8  m/s  "2”,  and  “the  equation  <net  force  =  m 
*  a>”.  Instead  of  parsing  arbitrary  algebraic  expressions,  an 
equation  identifier  attempts  shallow  parsing  of  equation  can¬ 
didates  and  maps  them  into  a  finite  set  of  anticipated  equa¬ 


tion  labels  (Makatchev  et  al.  2005),  producing  a  representa¬ 
tion  of  the  form  (math-form  mfl  <equation  label>) . 

In  addition  to  the  anticipated  equations,  there  are  repre¬ 
sentations  for  particular  relationships  between  two  variables: 

•  dependency:  (dependency  dl  ?varl  ?var2 

?relation  ?tl  ?t2),  where  dl  is  an  atom  identifier, 
?varl  and  ?var2  are  arbitrary  variables;  ?relation 
argument  can  be  of  sorts  dependent  and  independent, 
where  dependent  sort  has  values  proportional  and 
inversely-proportional;  and  ?tl  and  ?t2  are  time 
interval  variables. 

•  comparison:  (compare  cl  ?varl  ?var2  ?order 

?ratio  ?diff ) ,  where  cl  is  an  atom  identifier,  ?varl 
and  ?var2  are  arbitrary  variables,  ?order  specifies  the 
order  of  the  variables  in  the  binary  relation,  the  ?ratio 
argument  has  possible  values  relevant  to  ratios  (one, 
greater-than-one,  two),  and  ?diff  is  an  argument 
used  for  difference  comparisons. 

Various  degrees  of  formality 

NL  understanding  needs  to  distinguish  formal  versus  in¬ 
formal  physics  expressions  so  that  the  tutoring  system  can 
coach  on  proper  use  of  terminology.  Many  qualitative  me¬ 
chanics  phenomena  may  be  described  in  a  relatively  infor¬ 
mal  language,  for  example  “speed  up”  instead  of  “acceler¬ 
ate”  and  “push”  instead  of  “apply  a  force.”  The  relevant 
informal  expressions  fall  into  the  following  categories: 

•  relative  position:  “keys  are  behind  (in  front  of,  above,  un¬ 
der,  close,  far  from,  etc.)  man” 

•  motion:  “move  slower,”  “slow  down,”  “moves  along  a 
straight  line” 

•  dependency:  “horizontal  speed  will  not  depend  on  the 
force” 

•  direction:  “the  force  is  downward” 

•  interaction:  “the  man  pushes  the  pumpkin,”  “the  gravity 
pulls  the  ball” 

Each  of  these  categories  (except  for  the  last  one)  have  a 
dedicated  representation: 

•  (rel-position  rpl  ?rel-location  ?bodyl 
?body2  ?tl  ?t2 ) ,  where  the  ?rel-location  argu¬ 
ment  assumes  values  of  in-front-of,  behind,  above, 
below,  at,  closer,  father,  close,  far,  etc. 

•  (motion  ml  ?body  ?component  ?traj-shape 

?traj-speed  ?d-mag  ...  ?tl  ?t2),  where 

?component  specifies  the  vertical  or  horizontal 
component  of  a  vector  (or  none);  ?traj-shape  argu¬ 
ment  can  be  of  sort  linear  or  assume  values  of  sort 
curvilinear,  namely  parabolic,  circle,  ellipse; 
?tra  j-speed  specifies  the  speed  as  either  uniform  or 
nonuniform;  ?d-mag  (the  derivative  of  a  magnitude) 
specifies  whether  the  commonsense  rate  of  the  motion  is 
increasing  (“moves  faster”)  or  decreasing  (“slows 
down”). 

•  (dependency  . . .  ?relation  .  .  .  ) ,  where 

?relation  argument  can  be  of  sorts  dependent 
or  independent. 


Step 

Statement 

Justification 

1 

Both  balls  are  near  earth 

Unless  the  problem  says  otherwise,  assume  objects  are  near  earth 

2 

Both  balls  have  a  gravitational  force  on  them  due  to 
the  earth 

If  an  object  is  near  earth,  it  has  a  gravitational  force  on  it  due  to 
the  earth 

6 

Gravitational  force  is  w  =  m*g  for  each  ball 

The  force  of  gravity  on  an  object  has  a  magnitude  of  its  mass 
times  g,  where  g  is  the  gravitational  acceleration 

18 

The  balls  have  the  same  initial  vertical  position 

given 

19 

The  balls  have  the  same  vertical  position  at  all  times 

[Displacement  =  difference  in  position],  so  if  the  initial  positions 
of  two  objects  are  the  same  and  their  displacements  are  the  same, 
then  so  is  their  final  position 

20 

The  balls  reach  the  ground  at  the  same  time 

Figure  3:  A  fragment  of  an  informal  proof  for  the  Clay  Balls  problem.  The  required  points  are  in  bold. 


•  Qualitative  direction  constants  up,  down,  left,  right 
can  be  put  in  correspondence  with  otherwise  quantitative 
direction  variables  in  vector  predicates  via  reasoner  rules 
that  convert  formal  directions  into  informal  ones. 

•  While  representing  push  and  pull  expressions  via  a  dedi¬ 
cated  predicate  seems  straightforward,  we  are  still  assess¬ 
ing  the  utility  of  distinguishing  “man  pushes  the  pump¬ 
kin"  and  “man  applies  a  force  on  the  pumpkin”  for  our 
tutoring  application  and  currently  represent  both  expres¬ 
sions  as  a  nonzero  force  applied  by  the  man  to  the  pump¬ 
kin. 

These  representations  are  generated  by  a  combination  of 
symbolic  and  statistical  NLP  (Jordan,  Makatchev,  &  Van- 
Lehn  2004). 

Logical  structure 

One  of  the  tutoring  objectives  of  WHY2-ATLAS  is  to  en¬ 
courage  students  to  provide  argumentative  support  for  their 
conclusions.  This  requires  recognizing  and  representing 
the  justification-conclusion  clauses  in  student  explanations. 
Recognizing  such  clauses  is  a  challenging  NLP  problem  due 
to  the  issue  of  quantifier  and  causality  scoping.  It  is  also 
difficult  to  achieve  a  compromise  between  two  competing 
requirements  for  a  suitable  representation.  First,  the  KR 
should  be  flexible  enough  to  account  for  a  variable  number 
of  justifications.  Second,  reasoning  with  the  KR  should  be 
computationally  feasible. 

The  second  requirement  eliminates  the  representation 
of  relation  between  N  premises  and  1  conclusion  via 
N  atoms  (relation  cl  premise<i>  conclusionl ) , 
i  =  1 , ,N.  Indeed,  cross-referencing  between  atoms  via 
shared  variables  is  a  necessary  but  expensive  feature.  The 
best  known  algorithm  for  matching  cross-referenced  atoms 
has  time  complexity  0(2nn3),  where  n  is  the  number  of  in¬ 
put  atoms  (Shearer,  Bunke,  &  Venkatesh  2001). 

Another  possibility  is  using  variable  arity  predicates  of  the 
form  (cause  cl  causel  cause2  cause3  ...causeN 
ef  fectl) .  This  would  require  customizing  ourreasoners  to 
allow  soft  matching  between  variable  arity  predicates,  which 
leads  to  an  increase  in  search  space,  similar  to  the  case  with 
N  “short”  atoms  described  in  the  previous  paragraph. 


Another  nuance  of  justification-conclusion  representation 
is  asserting  versus  non-asserting  conditions.  Consider  the 
following  examples,  “if  there  was  air  resistance,  the  larger 
ball  would  fall  faster,”  and  “since  there  is  no  air  resistance, 
the  balls  fall  at  the  same  speed.”  Clearly  there  is  a  difference 
in  the  speaker’s  belief  about  whether  the  condition  actually 
holds  or  not.  The  logical  structures  of  these  sentences  can 
be  represented  as  A  — >  B  and  C  A  (C  — ^  D)  respectively. 
Due  to  the  combined  difficulties  of  NLP,  KR  and  reason¬ 
ing,  currently  we  largely  ignore  the  justification-conclusion 
clues  found  in  the  student’s  input  and  represent  both  cases 
as  conjunctions  of  atoms:  A  A  B  and  CAD  respectively. 

Rule  base 

The  rules  of  qualitative  mechanics,  likely  buggy  student  in¬ 
ferences  and  conversions  between  formal  and  informal  state¬ 
ments  are  represented  as  extended  Horn  clauses ,  namely  the 
head  of  the  rule  is  an  atom  or  a  conjunction  of  multiple 
atoms. 

Two  different  approaches  to  diagnosing  student  input  have 
been  implemented  during  different  stages  of  the  project:  ab- 
ductive  reasoning  and  ATMS. 

Abductive  diagnosis 

In  the  first  prototype  of  the  WHY2-ATLAS  tutoring  sys¬ 
tem  the  task  of  diagnosing  student  input  was  per¬ 
formed  on-the-fly  by  the  Tacitus-lite+  abductive  theorem 
prover  (Makatchev,  Jordan,  &  VanLehn  2004).  The  advan¬ 
tages  of  computing  the  diagnosis  on-the-fly  as  an  abduc- 
tively  generated  proof  that  minimizes  the  total  cost  of  as¬ 
sumptions  are  as  follows: 

•  An  abductive  proof  of  an  observed  student  input  expres¬ 
sion  can  be  generated  even  if  the  rule  base  is  incomplete 
and  the  assumptions  are  not  present  as  givens  (everything 
is  assumable). 

•  The  real-time  process  of  proof  generation  can  be  halted  at 
any  time  and  the  cheapest  proof  generated  so  far  can  be 
taken  as  the  best  current  approximation  of  the  diagnosis. 

Despite  these  advantages,  this  approach  had  two  major 
drawbacks: 


•  The  inherent  unsoundness  of  abductive  inference  requires 
a  costly  procedure  to  ensure  that  a  proof  is  consistent  (no 
two  mutually  exclusive  subgoals  are  produced). 

•  Lack  of  an  estimate  on  the  time  necessary  to  find  a  proof 
below  an  acceptable  cost  threshold. 

As  a  result,  the  average  time  required  by  the  abductive 
reasoner  for  processing  a  student’s  explanation  was  about 
170  seconds,  a  painful  delay  for  a  real-time  dialogue  appli¬ 
cation  (Makatchev,  Jordan,  &  VanLehn  2004). 

ATMS-based  diagnosis 

The  desire  to  reduce  the  amount  of  on-the-fly  computation 
by  computing  the  proofs  offline  led  us  to  adopt  a  power¬ 
ful  tool  from  model-based  diagnosis,  an  ATMS  (Forbus  & 
de  Kleer  1993).  ATMS’s  have  been  used  for  tasks  that 
are  closer  to  the  front  end  of  the  NLP  pipeline  such  as  for 
parsers  that  perform  reference  resolution  (e.g.  (Nishida  et 
al.  1988)),  but  there  are  few  systems  that  utilize  an  ATMS 
at  deeper  levels  of  NL  understanding  (e.g.  (Zernik  &  Brown 
1988)).  Our  implementation  of  the  ATMS  includes  a  subset 
of  the  deductive  closure  of  givens  and  correct  and  buggy  as¬ 
sumptions,  so  that  each  derived  proposition  (a  node )  carries 
a  list  of  assumptions  ( labels )  that  were  made  while  deriving 
it.  The  subset  of  the  deductive  closure  can  be  computed  and 
checked  for  completeness  off-line,  thus  improving  on-the- 
fly  efficiency  and  facilitating  knowledge  engineering. 

Completeness  and  correctness  analyzer 

All  domain  statements  that  are  potentially  required  to  be 
recognized  in  the  student’s  explanation  or  utterances  are  di¬ 
vided  into  principles  and  facts.  The  principles  are  versions 
of  general  physics  (and  “buggy  physics”)  principles  that  are 
either  of  a  vector  form  (for  example,  “F=ma”)  or  of  a  quali¬ 
tative  form  (for  example, “if  total  force  is  zero  then  acceler¬ 
ation  is  zero”),  while  facts  correspond  to  concrete  instantia¬ 
tions  of  the  principles  (for  example,  “since  there  is  no  hori¬ 
zontal  force  on  the  ball  its  horizontal  acceleration  is  zero”) 
or  to  derived  conclusions  (for  example,  “the  horizontal  ac¬ 
celeration  of  the  ball  is  zero”).  The  nodes  of  the  ATMS 
correspond  to  facts  derived  from  the  problem  givens,  so 
the  ATMS  can  only  provide  an  analysis  of  utterances  about 
facts,  but  not  principles.  Representations  of  input  utterances 
are  first  matched  against  a  set  of  manually  created  repre¬ 
sentations  of  relevant  general  principles,  important  facts  and 
misconceptions.  If  there  is  a  match,  the  input  utterance  is 
said  to  contain  an  explicit  correct  statement  or  an  explicit 
error.  Otherwise  the  statement  is  matched  against  the  ATMS 
facts  to  determine  whether  it  refers  to  an  implicit  correct 
statement  or  an  implicit  misconception. 

If  some  nodes  of  the  ATMS  match  the  representation  of 
the  input  utterance,  they  are  analyzed  for  correctness  by 
checking  whether  their  labels  contain  only  environments 
(consistent  sets  of  assumptions  that  are  sufficient  to  infer  a 
node)  with  buggy  assumptions.  In  the  case  when  there  are  no 
environments  that  are  free  of  buggy  assumptions  in  the  label 
of  the  node,  the  node  can  only  be  derived  using  one  of  the 
buggy  assumptions  and  therefore  represents  an  implicit  mis¬ 
conception.  These  buggy  assumptions  are  then  reported  to 


the  tutoring-system  strategist  for  possible  remediation.  Ad¬ 
ditionally,  a  neighborhood  of  radius  N  (in  terms  of  a  graph 
distance)  of  the  matched  nodes  can  be  analyzed  for  whether 
it  contains  any  of  the  required  facts  to  get  an  estimate  of  the 
proximity  of  a  student’s  utterance  to  a  required  point.1  A 
few  examples  of  the  utterance  analysis  are  presented  in  the 
next  section. 

Examples 

In  this  section  we  will  give  example  analyses  of  four  types 
of  utterances:  an  explicit  correct  statement  (a  fact  or  a  prin¬ 
ciple),  an  explicit  error,  an  implicit  correct  statement,  and  an 
implicit  misconception. 

Explicit  correct  statement  Consider  the  following  stu¬ 
dent  utterance  “Since  average  velocity  is  vf+vi/2,  the  balls 
will  hit  the  ground  at  the  same  time.”  The  sentence  gets  the 
following  FOPL  representation  by  the  system  (as  before,  we 
omit  the  representation  of  variable  sorts  for  brevity): 

(math-form  mfl  avgv-is-vf-plus-vi-over-2 ) 
(become  bl  contact  big-ball  ?body2  detached 
attached  ?tl  ?t2) 

(become  b2  contact  small-ball  ?body3 
detached  attached  ?t3  ? 1 4 ) 

The  first  atom  of  the  representation  is  generated  by  the 
equation  identifier,  described  in  (Makatchev  et  al.  2005). 
The  following  two  atoms  represent  that  the  small  ball  and 
the  big  ball  come  in  contact  with  some  body  at  some  time. 
Ideally,  variables  ?body2  and  ?body3  should  both  be  bound 
to  the  constant  earth,  and  time  points  should  be  equal  be¬ 
tween  these  two  predicates,  to  show  that  the  events  of  chang¬ 
ing  the  contact  state  from  detached  to  attached  happen  at  the 
same  time.  However,  due  to  imperfection  of  NL  to  FOPL 
conversion,  a  typical  representation  has  underconstrained 
variables. 

At  first,  the  analyzer  is  called  to  detect  a  direct  match  of 
the  utterance  with  stored  principles,  facts  and  misconcep¬ 
tions.  As  it  happens,  the  representation  above  contains  two 
atoms  that  are  close  to  the  stored  representation 

(become  b3  contact  big-ball  earth  detached 
attached  ?tl  ?t2) 

(become  b4  contact  small-ball  earth  detached 
attached  ?tl  ?t2) 

of  the  fact  which  is  a  correct  answer  to  the  problem,  namely 
“the  balls  hit  the  ground  at  the  same  time.”  The  matcher 
would  consider  such  partial  match  a  success.  Although  the 
first  atom  of  the  input  representation  above  is  a  part  of  the 
stored  representation 

(math-form  mf2  avgv-is-vf-plus-vi-over-2) 
(acceleration  aO  ?body0  . . .  constant  .  .  . 

?t5  ?t 6 ) 


1  The  value  of  N  depends  upon  the  tutoring  strategy  selected. 


of  the  principle  “If  acceleration  is  constant,  then  average  ve¬ 
locity  =  (vf+vi)/2,”  the  matcher  would  consider  the  overlap 
insufficient  to  call  this  a  match. 

After  comparing  the  input  representation  with  representa¬ 
tions  of  all  relevant  statements  the  analyzer  returns  the  re¬ 
sult:  The  utterance  matches  the  fact  “the  balls  hit  the  ground 
at  the  same  time.” 

Explicit  error  The  student  utterance  “the  big  ball  would 
hit  the  ground  before  the  small  ball,”  being  an  anticipated 
error,  has  a  matching  hand-coded  representation 

(become  b5  contact  big-ball  earth  detached 
attached  ?tl  ?t2) 

(become  b6  contact  small-ball  earth  detached 
attached  ?t3  ? 1 4 ) 

(before  ?t2  ?t4) 

as  a  buggy  statement  that  would  be  matched  via  a  process 
similar  to  the  one  described  in  the  previous  section. 

Implicit  correct  statement  A  more  sophisticated  process¬ 
ing  has  to  be  done  when  the  student  utterance  does  not  di¬ 
rectly  match  any  of  the  stored  representations.  This  can  hap¬ 
pen  due  to  two  reasons:  first,  the  utterance  is  not  a  valid  or 
relevant  statement  in  the  context  of  the  problem,  second  the 
statement  is  relevant  in  the  context  of  the  problem  but  it  is 
not  considered  important  for  tutoring  goals  and  therefore  it 
does  not  have  a  corresponding  hand-coded  stored  represen¬ 
tation.  Deploying  an  ATMS  aims  at  validating  the  second 
type  of  statements  and  alleviating  the  respective  knowledge¬ 
engineering  bottleneck.  Ideally,  ATMS  should  have  all  facts 
inferable  from  the  givens  (deductive  closure).  In  reality,  we 
settle  for  an  incomplete  ATMS  to  allow  for  efficient  real¬ 
time  matching  of  input  representations  with  its  nodes. 

Consider  the  following  utterance:  “The  large  ball  will 
have  a  greater  force  due  to  gravity.”  Although  this  is  a  cor¬ 
rect  fact  in  the  context  of  the  problem  it  is  not  deemed  nec¬ 
essary  for  the  solution  shown  in  Figure  3.  Consequently, 
there  is  no  stored  representation  of  this  fact  in  the  system. 
However,  being  unable  to  evaluate  its  correctness,  the  sys¬ 
tem  would  not  only  miss  the  opportunity  to  provide  a  posi¬ 
tive  feedback  to  the  student,  but  arguably  more  importantly, 
if  the  statement  turns  out  to  be  incorrect  (as  in  the  next  sec¬ 
tion),  the  system  would  miss  the  student  error. 

This  input  statement  is,  however,  inferable  from  the 
givens  of  the  problem,  and  therefore  is  part  of  the  deduc¬ 
tive  closure  of  the  givens.  If  the  ATMS  is  complete  enough, 
the  statement  is  represented  as  a  set  of  nodes  of  the  ATMS. 
In  this  case,  the  analyzer,  after  returning  NIL  as  a  result  of 
direct  match  of  the  input  utterance,  would  be  called  to  match 
the  input  with  the  ATMS  and  would  find  a  set  of  matching 
nodes  in  ATMS.  The  nodes  of  ATMS,  however,  contain  not 
only  facts  inferable  from  the  givens  via  correct  rules,  but 
also  facts  inferable  via  buggy  rules  and  buggy  assumptions. 
Therefore  having  a  match  with  the  ATMS  does  not  guaran¬ 
tee  correctness  of  the  statement  right  away.  Correctness  can 
be  easily  checked  by  examining  the  environments  (sets  of 
assumptions)  for  which  the  matching  nodes  hold  true.  The 
statement  “the  large  ball  will  have  a  greater  force  due  to 


gravity”  may  hold  in  multiple  environments,  some  of  which 
may  include  buggy  assumptions.  However,  since  it  can  be 
inferred  from  the  givens  only,  it  also  must  hold  in  the  envi¬ 
ronment  that  does  not  contain  any  buggy  assumptions.  Ex¬ 
istence  of  such  bug-free  environment  proves  correctness  of 
the  statement. 

Once  the  correctness  of  the  utterance  is  confirmed,  the 
chain  of  inferences  recovered  from  the  ATMS  can  be  used 
to  lead  the  student  back  on  track  of  the  required  solution  to 
the  problem. 

Implicit  misconception  Similar  processing  occurs  when 
the  statement  is  an  implicit  misconception.  Consider  the 
utterance  “The  balls  will  have  the  same  force  due  to  grav¬ 
ity.”  As  with  the  example  of  the  implicit  correct  statement 
above,  this  is  not  an  anticipated  statement  and  thus  doesn’t 
not  have  a  hand-coded  stored  representation.  However  it  in¬ 
dicates  that  student  has  a  misconception  that,  ideally,  should 
be  remediated  by  the  tutoring  system. 

Again,  since  the  statement  does  not  produce  any  direct 
matches,  it  is  matched  against  the  ATMS.  The  ATMS  in¬ 
cludes  a  number  of  buggy  assumptions  and  their  conse¬ 
quences.  One  of  the  buggy  assumptions  is  that  the  student 
believes  that  the  force  of  gravity  is  the  same  for  all  objects. 
While  the  representation  of  this  assumption  is  close  to  the 
student  utterance,  it  does  not  result  in  a  direct  match.  In¬ 
stead,  the  student  utterance  matches  a  statement  inferred  us¬ 
ing  this  buggy  assumption. 

There  are  no  bug-free  environments  for  which  the  state¬ 
ment  “Gravitational  force  is  the  same  for  both  balls”  holds, 
which  suggests  that  the  statement  is  wrong  (this  would  be 
guaranteed,  had  our  ATMS  contained  the  complete  deduc¬ 
tive  closure  of  the  givens).  In  fact,  one  of  the  environments 
for  which  the  statement  holds  contains  a  buggy  assumption 
“The  force  due  to  gravity  is  the  same  for  all  objects”.  Given 
the  limited  number  of  anticipated  buggy  assumptions  (which 
can  nevertheless  be  used  to  infer  a  large  number  of  erro¬ 
neous  facts)  each  of  them  can  have  a  hand-coded  remedi¬ 
ation  dialogue.  Thus  leveraging  on  the  features  of  ATMS 
helps  to  reduce  not  only  knowledge-engineering  effort  re¬ 
quired  for  building  formal  representations  of  potentially  rel¬ 
evant  statements,  but  also  the  effort  required  for  hand-coding 
knowledge-intensive  structures  in  other  parts  of  the  system. 

Preliminary  evaluation 

The  completeness  and  correctness  analyzer  has  been  de¬ 
ployed  in  an  evaluation  of  the  full  WHY2-ATLAS  tutoring 
system.  This  evaluation,  however,  did  not  use  essential  fea¬ 
tures  of  the  ATMS.  Instead,  we  evaluated  the  performance 
of  the  direct  matching  procedure.  Figure  4  shows  results 
of  classifying  62  student  utterances  for  one  physics  prob¬ 
lem  with  respect  to  46  stored  statement  representations  us¬ 
ing  only  direct  matching.  To  generate  these  results,  the  data 
is  manually  divided  into  7  groups  based  on  the  quality  of 
conversion  of  NL  to  FOPL,  such  that  group  7  consists  only 
of  perfectly  formalized  entries,  and  for  1  <  n  <  6  group 
n  includes  entries  of  group  n  +  1  and  additionally  entries 
of  somewhat  lesser  representation  quality,  so  that  group  1 
includes  all  the  entries  of  the  data  set.  The  flexibility  of 
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Figure  4:  Average  recall  and  precision  of  utterance  classi¬ 
fication.  The  size  of  a  group  of  entries  is  shown  relative  to 
the  size  of  the  overall  data  set.  Average  processing  time  is 
0.011  seconds  per  entry  on  a  1.8  GHz  Pentium  4  machine 
with  2Gb  of  RAM. 

the  matching  algorithm  allows  classification  even  of  utter¬ 
ances  that  have  mediocre  representations,  resulting  in  70% 
average  recall  and  82.9%  average  precision  for  56.5%  of  all 
entries  (group  4).  However,  large  numbers  of  inadequately 
represented  utterances  (at  least  38.7%  of  entries  that  did  not 
make  into  group  4)  result  in  53.2%  average  recall  and  59.7% 
average  precision  for  the  whole  data  set  (group  1).  These  re¬ 
sults  are  still  significantly  better  compared  to  the  two  base¬ 
line  classifiers  best  of  which  peaks  at  22.2%  average  recall 
and  precision.  The  first  baseline  classifier  always  assigns 
the  single  label  that  is  dominant  in  the  training  set  (average 
number  of  labels  per  entry  of  the  training  set  is  1 .36).  The 
second  baseline  classifier  independently  and  randomly  picks 
labels  according  to  their  distributions  in  the  training  set.  The 
most  frequent  label  in  the  training  set  corresponds  to  the  an¬ 
swer  to  the  problem.  Since  in  the  test  set  the  answer  always 
appears  as  a  separate  utterance  (sentence),  recall  and  preci¬ 
sion  rates  of  the  first  baseline  classifier  are  the  same. 

Although  the  current  evaluation  did  not  involve  match¬ 
ing  against  the  ATMS,  we  did  evaluate  the  time  required  for 
such  a  match  to  get  a  rough  comparison  with  the  earlier  on- 
the-fly  approach.  Matching  a  12  atom  input  representation 
against  a  128  node  ATMS  that  covers  55%  of  relevant  prob¬ 
lem  facts  takes  around  30  seconds,  which  is  a  considerable 
improvement. 

Conclusions  and  Future  Work 

Analyzing  NL  with  a  formal  logical  framework,  while  pro¬ 
viding  a  number  of  benefits,  is  a  task  with  long-standing 
challenges.  In  this  paper  we  presented  how  a  choice  of 
KR  and  reasoning  procedures  can  help  solve  the  problems 
of  expressiveness,  improve  performance,  and  reduce  the 
knowledge-engineering  effort.  We  described  two  reason¬ 
ing  frameworks  that  were  implemented  in  the  working  sys¬ 
tem  prototypes.  Replacing  an  on-the-fly  abductive  reasoning 
procedure  with  a  precomputed  off-line  ATMS  has  led  to  sig¬ 
nificant  reduction  of  processing  time.  A  more  detailed  eval¬ 
uation  of  the  ATMS  for  NL  understanding  will  be  conducted 
in  a  future  study. 
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